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Abstract 

We conduct an in-depth exploration of different strate¬ 
gies for doing event detection in videos using convolutional 
neural networks (CNNs) trained for image classification. 
We study different ways of performing spatial and tempo¬ 
ral pooling, feature normalization, choice of CNN layers 
as well as choice of classifiers. Making judicious choices 
along these dimensions led to a very significant increase in 
performance over more naive approaches that have been 
used till now. We evaluate our approach on the chal¬ 
lenging TRECVID MED’14 dataset with two popular CNN 
architectures pretrained on ImageNet. On this MED’14 
dataset, our methods, based entirely on image-trained CNN 
features, can outperform several state-of-the-art non-CNN 
models. Our proposed late fusion of CNN- and motion- 
based features can further increase the mean average preci¬ 
sion (mAP) on MED’ 14 from 34.95% to 38.74%. The fusion 
approach achieves the state-of-the-art classification perfor¬ 
mance on the challenging UCF-101 dataset. 


1. Introduction 

The huge volume of videos that are nowadays rou¬ 
tinely produced by consumer cameras and shared across the 
web calls for effective video classification and retrieval ap¬ 
proaches. One straightforward approach considers a video 
as a set of images and relies on techniques designed for 
image classification. This standard image classification 
pipeline consists of three processing steps: first, extract¬ 
ing multiple carefully engineered local feature descriptors 
(e.g., SIFT [2 ] or SURF [2]); second, the local feature de¬ 
scriptors are encoded using the bag-of-words (BoW) [4] or 
Fisher vector (FV) [2 ] representation; and finally, classifier 


is trained (e.g., support vector machines (SVMs)) on top of 
the encoded features. The main limitation of directly em¬ 
ploying the standard image classification approach to video 
is the lack of exploitation of motion information. This short¬ 
coming has been addressed by extracting optical flow based 
descriptors (eg. [2 ]), descriptors from spatiotemporal in¬ 
terest points ( e.g., [17, 5, 38, 18, 37]) or along estimated 
motion trajectories (e.g., [34, 13, 11, 35, 36]). 

The success of convolutional neural networks (CNNs) 
[20] on the ImageNet [7] has attracted considerable atten¬ 
tions in the computer vision research community. Since 
the publication of the winning model [15] of the Ima¬ 
geNet 2012 challenge, CNN-based approaches have been 
shown to achieve state-of-the-art on many challenging im¬ 
age datasets. For instance, the trained model in [15] has 
been successfully transferred to the PASCAL VOC dataset 
and used as a mid-level image representation [8]. They 
showed promising results in object and action classification 
and localization [24]. An evaluation of off-the-shelf CNN 
features applied to visual classification and visual instance 
retrieval has been conducted on several image datasets [28]. 
The main observation was that the CNN-based approaches 
outperformed the approaches based on the most successful 
hand-designed features. 

Recently, CNN architectures trained on videos have 
emerged, with the objective of capturing and encoding mo¬ 
tion information. The 3D CNN model proposed in [12] out¬ 
performed baseline methods in human action recognition. 
The two-stream convolutional network proposed in [2' ] 
combined a CNN model trained on appearance frames with 
a CNN model trained on stacked optical flow features to 
match the performance of hand-crafted spatiotemporal fea¬ 
tures. The CNN and long short term memory (LSTM) 
model have been utilized in [2 ] to obtain video-level rep¬ 
resentation. 
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Figure 1. Overview of the proposed video classification pipeline. 


In this paper, we propose an efficient approach to exploit 
off-the-shelf image-trained CNN architectures for video 
classification (Figure 1). Our contributions are the follow¬ 
ing: 

• We discuss each step of the proposed video classifi¬ 
cation pipeline, including the choice of CNN layers, 
the video frame sampling and calibration, the spatial 
and temporal pooling, the feature normalization, and 
the choice of classifier. All our design choices are 
supported by extensive experiments on the TRECVID 
MED’14 video dataset. 

• We provide thorough experimental comparisons be¬ 
tween the CNN-based approach and some state-of-the- 
art static and motion-based approaches, showing that 
the CNN-based approach can outperform the latter, 
both in terms of accuracy and speed. 

• We show that integrating motion information with a 
simple average fusion considerably improves classifi¬ 
cation performance, achieving the state-of-the-art per¬ 
formance on TRECVID MED’ 14 and UCF-101. 

Our work is closely related to other research efforts to¬ 
wards the efficient use of CNN for video classification. 
While it is now clear that CNN-based approaches outper¬ 
form most state-of-the-art handcrafted features for image 
classification [28], it is not yet obvious that this holds true 
for video classification. Moreover, there seems to be mixed 
conclusions regarding the benefit of training a spatiotem¬ 
poral vs. applying an image-trained CNN architecture on 
videos. Indeed, while Ji et al. [12] observed a significant 
gain using 3D convolutions over the 2D CNN architectures 
and Simonyan et al. [29] obtained substantial gains over 
an appearance based 2D CNN using optical flow features 
alone, Karpathy et al. [14] reported only a moderate im¬ 
provement. Although the specificity of the considered video 
datasets might play a role, the way the 2D CNN architecture 
is exploited for video classification is certainly the main rea¬ 
son behind these contradictory observations. The additional 
computational cost of training on videos is also an element 
that should be taken into account when comparing the two 


options. Prior to training a spatiotemporal CNN architec¬ 
ture, it thus seems legitimate to fully exploit the potential 
of image-trained CNN architectures. Obtained on a highly 
heterogeneous video dataset, we believe that our results can 
serve as a strong 2D CNN baseline against which to com¬ 
pare CNN architectures specifically trained on videos. 

2. Deep Convolutional Neural Networks 

Convolutional neural networks [20] consist of layers of 
spatially-structured hidden units. Each hidden unit typically 
looks at a small patch of hidden (or input) units in the pre¬ 
vious layer, applies convolution or pooling operations to it 
and then non-linearity to the result to compute its own state. 
A spatial patch of units is convolved with multiple filters 
(learned weights) to generate feature maps. A pooling oper¬ 
ation takes a spatial patch and computes typically maximum 
or average activation (max or average pooling) of that patch 
for each input channel, creating translation invariance. At 
the top of the stacked layers, the spatially organized hid¬ 
den units are usually densely connected, which eventually 
connect to the output units (e.g. softmax for classification). 
Regularizers, such as i 2 decay and dropout [32], have been 
shown to be effective for preventing overfitting. The struc¬ 
ture of deep neural nets enables the hidden layers to learn 
rich distributed representations of the input images. The 
densely connected layers, in particular, can be seen as learn¬ 
ing a high-level representation of the image accumulating 
information from all the spatial locations. 

We initially developed our work on the CNN architecture 
by Krizhevsky et al. [1: ]. In the recent ILSVRC-2014 com¬ 
petition, the top performing models GoogLeNet [33] and 
VGG [30] used smaller receptive fields and increased depth, 
showing superior performance over [15]. We adopt the pub¬ 
licly available pretrained VGG model for its superior post¬ 
competition single model performance over GoogLeNet 
and the popular Krizhevsky’s model in our system. The 
proposed video classification approach is generic with re¬ 
spect to the CNN architectures, therefore, can be adapted to 













































































other CNN architectures as well. 

3. Video Classification Pipeline 

Figure 1 gives an overview of the proposed video clas¬ 
sification pipeline. Each component of the pipeline is dis¬ 
cussed in this sections. 

Choice of CNN Layer We have considered the output 
layer and the last two hidden layers (fully-connected lay¬ 
ers) as CNN-based features. The 1,000-dimensional output- 
layer features, with values between [0,1], are the posterior 
probability scores corresponding to the 1,000 classes from 
the ImageNet dataset. Since our events of interest are dif¬ 
ferent from the 1,000 classes, the output-layer features are 
rather sparse (see Figure 2 left panel). The hidden-layer fea¬ 
tures are treated as high-level image representations in our 
video classification pipeline. These features are outputs of 
rectified linear units (RELUs). Therefore, they are lower 
bounded by zero but do not have a pre-defined upper bound 
(see Figure 2 middle and right panels) and thus require some 
normalization. 

Video Frame Sampling and Calibration We uniformly 
sampled 50 to 120 frames depending on the clip length. 
We have explored alternative frame sampling schemes (e.g., 
based on keyframe detection), but found that they all essen¬ 
tially yield the same performance as uniform sampling. The 
first layer of the CNN architecture takes 224 x 224 RGB 
images as inputs. We extracted multiple patches from the 
frames and rescale each of them to a size of 224 x 224. 
The number, location and original size of the patches are 
determined by selected spatial pooling strategies. Each of 
P patches is then used as input to the CNN architecture, 
providing P different outputs for each video frame. 

Spatial and Temporal Pooling We evaluated two pool¬ 
ing schemes, average and max, for both spatial and temporal 
pooling along with two spatial partition schemes. Inspired 
by the spatial pyramid approach developed for bags of fea¬ 
tures [19], we spatially pooled together the CNN features 
computed on patches centered in the same pre-defined sub- 
region of the video frames. As illustrated in Figure 3, 10 
overlapping square patches are extracted from each frame. 
We have considered 8 different regions consisting of 1 x 1, 
3x1 and 2x2 partitions of the video frame. For instance, 
features from patches 1, 2, 3 will be pooled together to yield 
a single feature in region 2. Such region specification con¬ 
siders the full frame, the vertical structure and four corners 
of a frame. The pooled features from 8 spatial regions are 
concatenated into one feature vector. Our spatial pyramid 
approach differs from that of [10, 9, 39] in that it is ap¬ 
plied to the output- or hidden-layer feature response of mul¬ 
tiple inputs rather than between the convolutional layer and 
fully-connected layer. The pre-specified regions are also 
different. We have also considered utilizing objectness to 
guide feature pooling by concatenating the CNN features 


extracted from the foreground region and the full frame. As 
shown in Figure 3, the foreground region is resulted from 
thresholding the sum of 1000 objectness proposals gener¬ 
ated from BING [3]. Unlike R-CNN [8] that extract CNN 
features from all objectness proposals for object detection 
and localization, We used only one coarse foreground re¬ 
gion to reduce the computational cost. 

Feature Normalization We compared different normal¬ 
ization schemes and identified normalization most appro¬ 
priate for the particular feature type. Let f G denote the 
original video-level CNN feature vector and f G R D its nor¬ 
malized version. We have investigated three different nor¬ 
malizations: £i normalization f = f/1|f ||i; £2 normaliza¬ 
tion f = f /11 f 11 2 , which is typically performed prior to train¬ 
ing an SVM model; and root normalization f = -y/f/||f || 1 
introduced in [1] and shown to improve the performance of 
SIFT descriptors. 

Choice of Classifier We applied SVM classifiers to 
the video-level features, including linear SVM and non¬ 
linear SVM with Gaussian radial basis function (RBF) 
kernel exp{— 7 ||x — y|| 2 } and exponential x 2 kernel 
exp {—7 ^ x *~+y. }• Principal component analysis 

(PCA) can be applied prior to SVM to reduce feature di¬ 
mensions. 

4. Modality Fusion 
4.1. Fisher Vectors 

An effective approach to video classification is to ex¬ 
tract multiple low-level feature descriptors; and then encode 
them as a fixed-length video-level Fisher vector (FV). The 
FV is a generalization of the bag-of-words approach that en¬ 
codes the zero-order, the first- and the second-order statis¬ 
tics of the descriptors distribution. The FV encoding proce¬ 
dure is summarized as follows [26] . First, learn a Gaussian 
mixture model (GMM) on low-level descriptors extracted 
from a generic set of unlabeled videos. Second, compute 
the gradients of the log-likelihood of the GMM (known 
as the score function) with respect to the GMM parame¬ 
ters. The gradient of the score function with respect to the 
mixture weight parameters encodes the zero-order statistics. 
The gradient with respect to the Gaussian means encodes 
the first-order statistics, while the gradient with respect to 
the Gaussian variance encodes the second-order statistics. 
Third, concatenate the score function gradients into a single 
vector and apply a signed square rooting on each FV dimen¬ 
sion (power normalization) followed by a global £2 normal¬ 
ization. As low-level features, we considered both the stan¬ 
dard D-SIFT descriptors [2. ] and the more sophisticated 
motion-based improved dense trajectories (IDT) [36]. For 
the SIFT descriptors, we opted for multiscale (5 scales) and 
dense (stride of 4 pixels in both spatial dimensions) sam¬ 
pling, root normalization and spatiotemporal pyramid pool- 





Figure 2. Distribution of values for the output layer and the last two hidden layers of the CNN architecture. Left: output. Middle: hidden6. 
Right: hidden7. 



(a) 



Figure 3. Pooling, (a). The 10 overlapping square patches extracted from each frame for spatial pyramid pooling. The red patch (centered 
at location 10) has sides equal to the frame height, whereas the other patches (centered at locations 1-9) have sides equal to half of the 
frame height, (b). Spatial pyramids (SP). Left: the 1 x 1 spatial partition includes the entire frame. Middle: the 3 x 1 spatial partitions 
include the top, middle and bottom of the frame. Right: the 2 x 2 spatial partitions include the upper-left, upper-right, bottom-left and 
bottom-right of the frame, (c). Objectness-based pooling. Left: objectness proposals. Middle: sum of 1000 objectness proposals. Right: 
foreground patch from thresholding proposal sums. 


ing. For the IDT descriptors, we concatenated HOG [5], 
HOF [6] and MBH [18] descriptors extracted along the es¬ 
timated motion trajectory. 

4.2. Fusion 

We investigated modality fusion to fully utilize the CNN 
features and strong the hand-engineered features. The 
weighted average fusion was applied: first, the SVM mar¬ 
gins was converted into posterior probability scores using 
Platt’s method [27]; and then the weights of the linear com¬ 
bination of the considered features scores for each class was 
optimized using cross-validation. We evaluated fusion fea¬ 
tures scores of different CNN layers and FVs. 

5. Evaluation in Event Detection 

5.1. Video Dataset and Performance Metric 

We conducted extensive experiments on the TRECVID 
multimedia event detection (MED)’14 video dataset [25]. 
This dataset consists of: (a) a training set of 4,992 unlabeled 
background videos used as the negative examples; (b) a 
training set of 2,991 positive and near-miss videos including 
100 positive videos and about 50 near-miss videos (treated 
as negative examples) for each of the 20 pre-specified events 
(see Table 1); and (c) a test set of 23,953 videos contains 
positive and negative instances of the pre-specified events. 


Some sample frames are given in Figure 4. Contrary to 
other popular video datasets, such as UCF-101 [31], the 
MED’14 dataset is not constrained to any class of videos. 
It consists of a heterogeneous set of temporally untrimmed 
YouTube-like videos of various resolutions, quality, camera 
motions, and illumination conditions. This dataset is thus 
one of the largest and the most challenging dataset for video 
event detection. As a retrieval performance metric, we con¬ 
sidered the one used in the official MED’ 14 task, i.e., mean 
average precision (mAP) across all events. Let E denote 
the number of events, P e the number of positive instances 
of event e, then mAP is computed as 

1 E 

mAP= ^E AP ( e )’ (D 

e=l 

where the average precision of an event e is defined as 
AP(e) 4y —4^- 

p e 4 rank (^) 

mAP is thus normalized between 0 (low classification per¬ 
formance) and 1 (high classification performance). In this 
paper, we will report it in percentage value. The mAP is 
normalized between 0 (low classification performance) and 
1 (high classification performance). In this paper, we will 
report it in percentage value. 






























Attempting Bike trick Dog show Marriage proposal Renovating a home Town hall meeting 



Winning a race without a 
vehicle 


Non-motonzed vehicle Fixing a musical 

repair instrument Horse riding competition 

Figure 4. Sample frames from the TRECYID MED ’ 14 dataset. 


Parking a vehicle 


Events E021-E030 

Events E031-E040 

Attempting a bike trick 

Cleaning an appliance 

Dog show 

Giving directions 

Marriage proposal 

Renovating a home 

Rock climbing 

Town hall meeting 

Winning a race w/o a vehicle 
Working on a metal crafts project 

Beekeeping 

Wedding shower 
Non-motorized vehicle repair 
Fixing a musical instrument 
Horse riding competition 
Felling a tree 

Parking a vehicle 

Playing fetch 

Tailgating 

Tuning a musical instrument 


Table 1. TRECYID 2014 pre-specified events. 


5.2. Single Feature Performance 

The single feature performance with various configura¬ 
tions are reported in Table 2 and 3. The following are a few 
salient observations. 

CNN architectures and layers The deeper CNN archi¬ 
tecture yields consistently better performance resulted from 
the depths and small receptive field in all convolutional lay¬ 
ers. We also observed that both hidden layers outperformed 
the output layer when the CNN architecture, normalization 
and spatiotemporal pooling strategies are the same. 

Pooling, spatial pyramids and objectness We observed 
a consistent gain for max pooling over average pooling for 
both spatial and temporal pooling, irrespectively of the used 
CNN layer. It is mainly resulted from the highly hetero¬ 
geneous structure of the video dataset. A lot of videos 
contain frames that can be considered irrelevant or at least 
less relevant than others; e.g., introductory text and black 
frames. Hence it is beneficial to use the maximum features 
response, instead of giving an equal weight to all features. 
As observed in Table 2, concatenating the 8 spatial parti¬ 
tions (SP8) gives the best performance in all CNN layers 


and S VM choices (up to 6% mAP gain over no spatial pool¬ 
ing, i.e., “none”), but at the expense of an increased fea¬ 
ture dimensionality and consequently, an increased training 
and testing time. Alternatively, objectness-guided pooling 
provides a good trade-off between performance and dimen¬ 
sionality, shown in Table 3. It outperforms the baseline ap¬ 
proach without spatial expansion (“none” in Table 2). The 
feature dimensions are only one-fourth of that of SP8; while 
the performance nearly matches that of SP8 using kernel 
SVM. 

Normalization We observed that £\ normalization did 
not perform well; while the £2 normalization is essential 
to achieve good performance with hidden layers. Applying 
root normalization to the output layer yields essentially the 
same result as applying £ 2 . Yet, we noticed a drop in per¬ 
formance when the root normalization was applied to the 
hidden layers. 

Classifier One SVM model was trained for each event 
following the TRECVID MED’14 training rules; i.e., ex¬ 
cluding the positive and near-miss examples of the other 
events. We observed that kernel SVM consistently outper¬ 
formed linear SVM regardless of CNN layer or normaliza¬ 
tion. For the output layer, which essentially behaves like a 
histogram-based feature, the x 2 kernel yields essentially the 
same result as RBF kernel. For the hidden layers, the best 
performance was obtained with a RBF kernel. As shown in 
Table 2, the result of kernel SVM largely outperforms that 
of linear SVM. 

PCA We analyzed dimensionality reduction of the top 
performing feature, CNN-B-hidden7-SP8, shown in Ta¬ 
ble 4. Applying PCA reduced the feature dimension from 
32, 768 to 4096, 2048 and 1024 prior to SVM, which re¬ 
sulted in slightly reduced mAPs as compared to classifica¬ 
tion of features without PCA. However, it still outperformed 
or matched the performance of the features without spatial 


















Layer 

Dim. 

SP 

Norm 

SVM 

mi 

CNN A 

\P 

CNN B 

output 

1,000 

none 

root 

linear 

15.90% 

19.46% 

output 

8,000 

SP8 

root 

linear 

22.04% 

25.67% 

output 

8,000 

SP8 

£2 

linear 

22.01% 

25.88% 

output 

1,000 

none 

root 

x 2 

21.54% 

25.30% 

output 

8,000 

SP8 

root 

x 2 

27.22% 

31.24% 

output 

8,000 

SP8 

root 

RBF 

27.12% 

31.73% 

hidden6 

4,096 

none 

ti 

linear 

22.11% 

25.37% 

hidden6 

32,768 

SP8 

root 

linear 

21.13% 

26.27% 

hidden6 

32,768 

SP8 

h 

linear 

23.21% 

28.31% 

hidden6 

4,096 

none 

h 

RBF 

28.02% 

32.93% 

hidden6 

32,768 

SP8 

1 2 

RBF 

28.20% 

33.54% 

hidden7 

4,096 

none 

h 

linear 

21.45% 

25.08% 

hidden7 

32,768 

SP8 

root 

linear 

23.72% 

26.57% 

hidden7 

32,768 

SP8 

h 

linear 

25.01% 

29.70% 

hidden7 

4,096 

none 

h 

RBF 

27.53% 

33.72% 

hidden7 

32,768 

SP8 

1 2 

RBF 

29.41% 

34.95% 


Table 2. Various configurations of video classification. Spatiotem- 
poral pooling: max pooling. A and B: 8- and 19-layer CNNs from 
[ 5] and [30], resp. SP8: thelxl + 3xl + 2x2 and 3x1 
spatial pyramids. Dataset: TRECVID MED’14 lOOEx. 


Layer 

Dim. 

Norm 

SVM 

mAP 

output 

2,000 

root 

linear 

23.62% 

output 

2,000 

root 

X 2 

29.26% 

hidden6 

8,192 

h 

linear 

26.69% 

hidden6 

8,192 

1 .2 

RBF 

33.33% 

hidden7 

8,192 

h 

linear 

27.41% 

hidden7 

8,192 

£2 

RBF 

34.51% 


Table 3. Objectness-guided pooling; CNN B; MED’14 lOOEx. 


Dim. 

SVM 

linear 

finAP 

RBF 

32,768 

29.70% 

34.95% 

4096 

2048 

1024 

29.69% 

29.02% 

28.34% 

34.55% 

34.00% 

33.18% 


Table 4. Applying PC A on the top-perfoming feature in Table 2 
prior to SVM. Feature: hidden7-layer feature extracted from CNN 
architecture B with SP8 and £2 normalization. 

pyramids (CNN-B-hidden7-none in Table 2). Thus the spa¬ 
tial pyramid is effective in capturing useful information; and 
the feature dimension can be reduced without loss of impor¬ 
tant information. 

5.3. Fusion Performance 

We evaluated fusion of CNN features and Fisher vec¬ 
tors (FVs). The mAPs (and features dimensions) obtained 
with D-SIFT+FV and IDT+FV using RBF SVM classifiers 
were 24.84% and 28.45% (98,304 and 101,376), respec¬ 
tively. Table 5 reports our various fusion experiments. As 


expected, the late fusion of the static (D-SIFT) and motion- 
based (IDT) FVs brings a substantial improvement over the 
results obtained by the motion-based only FV. The fusion of 
CNN features does not provide much gain. This can be ex¬ 
plained by the similarity of the information captured by the 
hidden layers and the output layer. Similarly, fusion of the 
hidden layer with the static FV does not provide improve¬ 
ment, although the output layer can still benefit from the 
late fusion with the static FV feature. However, fusion be¬ 
tween any of the CNN-based features and the motion-based 
FV brings a consistent gain over the single best performer. 
This indicates that appropriate integration of motion infor¬ 
mation into the CNN architecture leads to substantial im¬ 
provements. 


Features 

mAP 

vs. single feat. 

D-SIFT+FV, IDT+FV 

33.09% 

+8.25%, +4.64% 

CNN-output, CNN-hidden6 
CNN-output, CNN-hidden7 
CNN-hidden6, CNN-hidden7 

35.04% 

34.92% 

34.85% 

+3.80%,+1.50% 
+3.68%, -0.03% 
+1.31%, -0.10% 

CNN-output, D-SIFT+FV 
CNN-hidden6, D-SIFT+FV 
CNN-hidden7, D-SIFT+FV 

31.45% 

31.71% 

33.21% 

+0.21%,+6.61% 
-1.83%, +6.87% 
-1.74%, +8.37% 

CNN-output, IDT+FV 
CNN-hidden6, IDT+FV 
CNN-hidden7, IDT+FV 

37.97% 

38.30% 

38.74% 

+6.73%, +9.52% 
+4.76%, +9.85% 
+3.79%, +10.29% 


Table 5. Fusion. Used the best performing CNN features with ker¬ 
nel SVMs in Table 2. 


5.4. Comparison with the State-of-the-Art 

Table 6 shows the comparison of the proposed approach 
with strong hand-engineered approaches and other CNN- 
based approach on TRECVID MED’14. The proposed ap¬ 
proach based on CNN features significantly outperformed 
strong hand-engineered approaches (D-SIFT+FV, IDT+FV 
and MIFS) even without integrating motion information. 
The CNN-based features have a lower dimensionality than 
the Fisher vectors. In particular, the dimension of the output 
layer is an order of magnitude smaller. The CNN architec¬ 
ture has been trained on high resolution images, whereas 
we applied it on low-resolution video frames which suf¬ 
fer from compression artifacts and motion blur, This fur¬ 
ther confirms that the CNN features are very robust, despite 
the domain mismatch. The proposed approach also outper¬ 
forms the CNN-based approach by Xu et al. [39]. Devel¬ 
oped independently, they used the same CNN architecture 
as ours, but different CNN layers, pooling, feature encoding 
and fusion strategies. The proposed approach outperforms 
all these competitive approaches and yields the new state- 
of-the-art, thanks to the carefully designed system based on 
CNNs and the fusion with motion-based features. 
































Method 

CNN 

mAP 

D-SIFT [22]+FV [2 ] 

no 

24.84% 

IDT [36]+FV [26] 

no 

28.45% 

D-SIFT+FV, IDT+FV (fusion) 

no 

33.09% 

MIFS [16] 

no 

29.0 % 

CNN-LCDvlad with multi-layer fusion [39] 

yes 

36.8 % 

proposed: CNN-hidden6 

yes 

33.54% 

proposed: CNN-hidden7 

yes 

34.95% 

proposed: CNN-hidden7, IDT+FV 

yes 

38.74% 


Table 6. Comparison with other approaches on TRECVID 
MED’14 lOOEx 


CNN 

archi. 

Features 

split 1 

acc. 

split 2 

acc. 

split 3 

acc. 

mean 

acc. 

A 

output 

69.02% 

68.21% 

68.45% 

68.56% 

A 

hidden6 

70.76% 

71.21% 

72.05% 

71.34% 

A 

hidden7 

72.35% 

71.64% 

73.19% 

72.39% 

B 

output 

75.65% 

75.74% 

76.00% 

75.80% 

B 

hidden6 

79.88% 

79.14% 

79.00% 

79.34% 

B 

hidden7 

79.01% 

79.30% 

78.73% 

79.01% 


Table 7. Accuracy of Single Features using proposed approach 
on UCF-101. Configurations are the same as in experiments on 
TRECVID MED’14. Features are extracted with SP8. All fea¬ 
tures are classified with linear SVMs. 

6. Evaluation in Action Recognition 

6.1. Single Feature Performance 

We evaluated our approach on a well-established action 
recognition dataset UCF-101 [31]. We followed the three 
splits in the experiments and report overall accuracy in each 
split and the mean accuracy across three splits. The config¬ 
uration was the same as the one used in TRECVID MED’ 14 
experiments, except that only linear SVM was used for fare 
comparison with other approaches on this dataset. The re¬ 
sults are given in Table 7. The CNN architectures from [15] 
and [30] are referred as A and B. It shows that using hidden 
layer features yields better recognition accuracy compared 
to using softmax activations at the output layer. The per¬ 
formance also improves by up to 8% using deeper CNN 
architecture. 

6.2. Fusion Performance 

We extracted IDT+FV features on this dataset and ob¬ 
tained mean accuracy of 86.5% across three splits. Unlike 
the event detection task in TRECVID MED’ 14 dataset, the 
action recognition in UCF-101 is temporally trimmed and 
more centered on motion. Thus, the motion-based IDT+FV 
approach outperformed the image-based CNN-approach. 
However, as shown in Table 8, a simple weighted aver¬ 
age fusion of CNN-hidden6 and IDT+FT features boosts 
the performance to the best accuracy of 89.62%. 


CNN 

CNN 

split 1 

split 2 

split 3 

mean 

archi. 

layer 

acc. 

acc. 

acc. 

acc. 

A 

output 

85.57% 

87.01% 

87.18% 

86.59% 

A 

hidden6 

86.39% 

87.36% 

87.61% 

87.12% 

A 

hidden7 

86.15% 

87.98% 

87.74% 

87.29% 

B 

output 

87.95% 

88.43% 

89.37% 

88.58% 

B 

hidden6 

88.63% 

90.01% 

90.21% 

89.62% 

B 

hidden7 

88.50% 

89.29% 

90.10% 

89.30% 


Table 8. Accuracy of late fusion of CNN features and IDT+FV fea¬ 
tures. The configurations are the same as in Table 7. We extracted 
IDT+FV features based on [36, 2( ]. 


Method acc. 

Spatial stream ConvNet [29] 73.0% 

Temporal stream ConvNet [ }] 83.7% 

Two-stream ConvNet fusion by avg [29] 86.9% 

Two-stream ConvNet fusion by SVM [2 1 ] 88.0% 

Slow-fusion spatiotemporal ConvNet [ 1] 65.4% 

Single-frame model [23] 73.3% 

LSTM (image + optical flow) [23] 88.6% 

proposed: CNN-hidden6 only 79.34% 


proposed: CNN-hidden6, IDT+FV (avg. fusion) 89.62% 

proposed: CNN-hidden7, IDT+FV (avg. fusion) 89.30% 

Table 9. Comparison with other approaches based on neural net¬ 
works in mean accuracy over three splits on UCF-101 

6.3. Comparison with the State-of-the-Art 

Table 9 shows comparisons with other approaches based 
on neural networks. Note that the proposed image-based 
CNN-approach yields superior performance (6.3%, 6.0% 
and 13.9% higher) than the spatial stream ConvNet in [29], 
the single-frame model in [23] and the slow-fusion spa¬ 
tiotemporal ConvNet in [14] even though our model is not 
fine-tuned on specific dataset. The fusion performance of 
CNN-hidden6 (or CNN-hidden7) and IDT+FV features also 
outperforms the two stream CNN approach [29] and long 
short term memory (LSTM) approach that utilizing both im¬ 
age and optical flow [23]. 

7. Computational Cost 

We have benchmarked the extraction time of the Fisher 
vectors and CNN features on a CPU machine. Extracting 
D-SIFT (resp. IDT) Fisher vector takes about 0.4 (resp. 5) 
times the video playback time, while the extraction of the 
CNN features requires 0.4 times the video playback time. 
The CNN features can thus be extracted in real time. On 
the classifier training side, it requires about 150s to train a 
kernel SVM event detector using the Fisher vectors, while 
it takes around 90s with the CNN features using the same 
training pipeline. On the testing side, it requires around 30s 
to apply a Fisher vector trained event model on the 23,953 
TRECVID MED’14 videos, while it takes about 15s to ap- 

























ply a CNN trained event model on the same set of videos. 

8. Conclusion 

In this paper we proposed a step-by-step procedure to 
fully exploit the potential of image-trained CNN architec¬ 
tures for video classification. While every step of our pro¬ 
cedure has an impact on the final classification performance, 
we showed that CNN architecture, the choice of CNN 
layer, the spatiotemporal pooling, the normalization, and 
the choice of classifier are the most sensitive factors. Using 
the proposed procedure, we showed that an image-trained 
CNN architecture can outperform competitive motion- and 
spatiotemporal- based non-CNN approaches on the chal¬ 
lenging TRECVID MED’14 video dataset. The result 
shows that improvements on the image-trained CNN archi¬ 
tecture are also beneficial to video classification, despite the 
domain mismatch. Moreover, we demonstrated that adding 
some motion-information via late fusion brings substantial 
gains, outperforming other vision-based approaches on this 
MED’14 dataset. Finally, the proposed approach is com¬ 
pared with other neural network approaches on the action 
recognition dataset UCF-101. The image-trained CNN ap¬ 
proach is comparable with the state-of-the-art and the late 
fusion of image-trained CNN features and motion-based 
IDT-FV features outperforms the state-of-the-art. 

In this work we used an image-trained CNN as a black¬ 
box feature extractor. Therefore, we expect any improve¬ 
ments in the CNN to directly lead to improvements in video 
classification as well. The CNN was trained on the Ima- 
geNet dataset which mostly contains high resolution photo¬ 
graphic images whereas the video dataset is fairly heteroge¬ 
neous in terms of quality, resolution, compression artifacts 
and camera motion. Due to this domain mismatch, we be¬ 
lieve that additional gains can be achieved by fine-tuning the 
CNN for the dataset. Even more improvements can possi¬ 
bly be made by learning motion information through a spa¬ 
tiotemporal deep neural network architecture. 
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